Single-cell transcriptomics highlights immunological dysregulations of monocytes in the pathobiology of COPD

Background Chronic obstructive pulmonary disease (COPD) is a common respiratory disease, whose pathogenetic complexity was strongly associated with aging/smoking and poorly understood. Methods Here we performed single-cell RNA sequencing (scRNA-seq) analysis of 66,610 cells from COPD and age-stratified control lung tissues of donors with different smoking histories to prioritize cell types most perturbed in COPD lungs in aging/smoking dependent or independent manner. By performing an array of advanced bioinformatic analyses, such as gene set enrichment analysis, trajectory analysis, cell–cell interactions analysis, regulatory potential analysis, weighted correlation network analysis, functional interaction analysis, and gene set variation analysis, we integrated cell-type-level alterations into a system-level malfunction and provided a more clarified COPD pathological model containing specific mechanisms by which aging and smoking facilitate COPD development. Finally, we integrated the publicly available scRNA-seq data of 9 individuals, resulting in a total of 110,931 cells, and replicated the analyses to enhance the credibility of our findings. Results Our study pointed to enrichment of COPD molecular alteration in monocytes, which further induced a previously unrecognized pro-inflammatory effect on alveolar epithelial cells. In addition, aged monocytes and club cells facilitated COPD development via maintaining an autoimmune airway niche. Unexpectedly, macrophages, whose defect to resolve inflammation was long-recognized in COPD pathogenesis, primarily induced an imbalance of sphingolipids rheostat in a smoking-dependent way. These findings were validated in a meta-analysis including other public single-cell transcriptomic data. Conclusions In sum, our study provided a clarified view of COPD pathogenesis and demonstrated the potential of targeting monocytes in COPD diagnosis and treatment. Supplementary Information The online version contains supplementary material available at 10.1186/s12931-022-02293-2.

profiles was conducted by uniform manifold approximation and projection (UMAP).
The Louvain modularity optimization algorithm was applied to iteratively group cells together into clusters for cell type identification. Cell clusters were annotated to known biological cell types using canonical cell marker genes.

Differential gene expression analysis
To identify genes differentially expressed under three conditions (aging, smoking, and COPD) in each cell type, MAST (1.14.0) 3 was used to perform zero-inflated regression analysis by fitting a linear mixed model. We controlled both technical variation and individual variation by using a two-part hurdle model. The following model was fit with MAST: zlm( ~ COPD+ age + (1|individual) + smoking + percent.mt + nCount_RNA , sca, method='glmer', ebayes=FALSE) Where nCount_RNA was the total number of molecules detected within a cell and percent.mt was mitochondria fraction. To identify genes differentially expressed due to the disease effect, likelihood ratio test was performed by comparing the model with and without the COPD variable. Multiple hypothesis testing was performed with Bonferroni and Holm corrections. For COPD/smoking variable, genes with absolute value of log2(fold change) > 0.14 (10% difference in expression) and FDR < 0.05 between conditions were selected as differentially expressed genes (DEGs). For aging variable, we treat aging as a continuous covariate while controlling for technical 4 variation and individual variation. Genes were considered as significantly differentially expressed with an FDR < 0.05 and an age coefficient > 0.2 or < -0.2.

Enrichment analysis of gene sets
Gene Ontology Gene enrichment was performed using the R package clusterProfiler (version 3.14.3) 4 . Enrichment in GO Biological process between DEGs of COPD of each cell type was analyzed using the compareCluster. Significantly enriched GO terms were simplified using simplify function remove highly similar terms (cutoff = 0.7) by retaining the most significant representative term. Enrichment in GO Biological process on DEGs of aging and smoking were analyzed as above.
Enrichment analysis using custom genes sets was performed using the appropriate background lists (genes detected >10 % of cells in each cell type). Enrichment of COPD-associated DEGs at COPD GWAS risk genes 5 was performed using Fisher's exact test to determine cell types contributing to COPD heritability.

Downsampling analysis to identify condition-relevant cell types
To calculate the number of COPD-associated DEGs across cell types normalized by the number of cells in each cell type, we randomly drew 700 cells from each cell type before performing differential gene expression analysis. This analysis was repeated 10 times and the number of DEGs for each cell type were presented as boxplots. Agingassociated DEGs and smoking-associated DEGs were performed as above.

Cell type prioritization analysis
To delineate the cell type-specific responses to different biological conditions (aging, smoking, and COPD), cell type prioritization analysis in three conditions was performed using 'calculate_auc' function implemented in the Augur R package with default parameters (version 1.0.0) 6 . As input, we used the gene expression counts that were normalized with a scale factor of 10,000 UMIs per cell and subsequently natural log-transformed. Different conditions were analyzed separately. COPD and smoking cell type prioritization scores from augur analysis were AUC, and aging cell type prioritization score from augur analysis was CCC. The cell type prioritization score was displayed in heatmap.

Sub-clustering and identification of markers of sub-cluster
To identify sub-clusters within monocytes, we reanalyzed cells belonging to monocytes with a resolution of 0.2, and three sub-clusters of monocytes were identified: CD14+ classical monocytes, CD14+ intermediate monocytes and CD16+ non-classical monocytes. we further clustered our club cells with a resolution of 0.1 and generated 2 sub-clusters: mix sub-cluster, autoimmune-prone sub-cluster.
To distinguish sub-clusters within dendritic cells, we reanalyzed cells belonging to dendritic cells with a resolution of 0.2. According to the gene expression of each subcluster, we obtained five dendritic clusters, including IGSF2+ DC, TREM2+ DC, 6 myeloid DC1 (mDC1), myeloid DC2 (mDC2), plasmacytoid DC (pDC). To identify sub-clusters within macrophage, we reanalyzed cells belonging to macrophage with a resolution of 0.2, and two sub-clusters of macrophage were identified: one FABP4+ macrophage and one FABP4-macrophage.
We then applied the Seurat FindAllMarkers to generate cell type sub-cluster marker.

Gene set enrichment analysis (GSEA)
To determine whether differential expression genes within a given cell type were significantly enriched with meaningful biological processes, gene set enrichment analysis (GSEA) was done using the clusterProfiler R package (3.12.0) 4 . Only genes detected in >10% of cells of a given cell type were evaluated. All annotated gene sets were collected from the Molecular Signatures Database (MSigDB: C5: ontology gene sets). Gene set with a P-value less than 0.05 was considered as significantly enriched.

Trajectory analysis
To identify lineage trajectories, we used Slingshot (version 1.4.0) 7 . The cluster representing CD14+ classical monocytes, AT2s, and club cells were chosen as the root node separately to construct lineages of differentiation to other cell types based on PCA dimension reduction and estimate pseudotime of each cell along the lineages.
After running slingshot, we used the tradeSeq package (version 1.4.0) 8 to find genes dynamically expressed during cell differentiation. For each HVG, we fit a general 7 additive model (GAM) using a negative binomial noise distribution to model the relationships between gene expression and pseudotime and tested for significant associations between expression and pseudotime using the associationTest. We picked out the top genes based on wald statistics and visualized their expression over developmental time with a heatmap.

Monocytes differentiation driver genes enrichment analysis
To identify lineage trajectories of monocytes sub-clusters, we performed trajectory inference analysis using Slingshot. The cluster representing CD14+ classical monocytes was chosen as the root node to construct lineages of differentiation to CD14+ intermediate monocytes. we fit a general additive model (GAM) using a negative binomial noise distribution to model the relationships between gene expression and pseudotime and tested for significant associations between expression and pseudotime using the associationTest. We ranked genes according to wald statistics (the gene with the largest waldStat is ranked the top) and picked out the top 250 genes based on wald statistics as monocytes differentiation driver genes. To investigate whether DEG sunder three conditions (COPD, aging, smoking) in monocytes are enriched in the pre-ranked 250 monocytes differentiation driver genes, we performed Kolmogorov-Smirnov (K-S) test on DEGs of monocytes and visualized with a kolmogorov-smirnov plot. The one-sided (greater) K-S test was used to determine the enrichment score. The nominal p-value estimates the statistical 8 significance of a single gene set's enrichment score, based on the null distribution generated from gene set permutation.

Cell-cell interactions (CCIs) analysis
To infer potentially relevant interactions between two cell populations, CCIs analysis was conducted with the CellPhoneDB software (version 2.0) 9 . For the COPD condition, normalized gene expression matrix of control and COPD patients was first corrected to remove variation due to smoking and aging using WGCNA R package function 'empiricalBayesLM'. The same method is used to eliminate variation due to COPD and smoking in the aging condition and variation due to COPD and aging in the smoking condition. In CCIs analysis, COPD and control were analyzed separately.
Only receptors and ligands detected in >10% of cells in each cell type from COPD were further evaluated, while a CCI was considered nonexistent if the ligand or the receptor was unmeasurable. Averaged expression of each ligand-receptor pair was analyzed between various cell types, and only those whose average expression value was in the top 75% quantile among the ligand-receptor pairs with p-value < 0.05 were used for the prediction of CCIs between any two cell types. The control group was analyzed as above. We quantified the degree of change in cell interactions in the COPD condition by the difference between the number of significant interaction ligand-receptor pairs in COPD and the number of significant interaction ligandreceptor pairs in control. The degree of changes in CCIs under aging (old vs. young 9 group) and smoking (active smoker vs. never-smoker group) conditions was analyzed as above.

Intercellular interaction network of ligands and receptors
To identify and illustrate alterations in the intercellular signaling network, iTALK  11 . The interaction between monocytes and immune cells within COPD was analyzed as above.

Regulatory potential analysis
To speculate the regulatory potential of ligands in monocytes on differential expressed genes in interacting cells, we applied the NicheNet analysis using nichenetr R package (version 1.0.0) 12 . As input genes to infer the ligand activity score, we defined all DEG genes with FDR < 0.05 and log2fc > 0 within a given cell type in COPD condition. As background genes, we defined all genes that were not DEG within a given cell type in COPD conditions and detected in > 10% of cells. For ligand prioritization, we selected the genes with FDR< 0.05 and log2fc > 0.14 in COPD conditions were defined as large change potential_ligands within a given cell type. 11 Genes with FDR < 0.05 in COPD conditions were defined as potential_ligands within a given cell type.

Potential upstream regulators prediction
To predict potential upstream transcriptional regulators of the COPD-associated DEGs in AT1s, AT2s, endothelial cells, stromal cells, club cells, and ciliated cells, we performed enrichment analysis of DEGs using Metascape software 13 . TRRUST ontology categories with a p-value < 0.01, a minimum count of 3, and an enrichment factor > 1.5 in metascape were considered as significant DEG potential upstream regulators. The pathway annotation of upstream regulators was classified according to the pathway types in the KEGG database.

Weighted correlation network analysis (WGCNA)
To identify the co-expression gene modules of alveolar type 2 cells, a weight gene coexpression network analysis was performed using WGCNA R package (version 1.69) 14 . Expression was first corrected to remove variation due to smoking and aging using empiricalBayesLM. The appropriate soft-thresholding power was chosen based on a scale-free topology criterion and the weighted adjacency matrix was constructed using the soft-thresholding power. The relationship between one gene and all other ones in the analysis was incorporated, and the adjacency matrix was transformed into the topological matrix (TOM). The genes demonstrated hierarchical clustering according to the TOM-based dissimilarity (1-TOM) measure, and highly interconnected genes were assigned to the same module. Then, the eigengenes of WGCNA modules were calculated to represent the overall gene expression profiles in each module and then were tested for statistical significance of association with the COPD. Hub genes are defined as genes with high correlation with the remaining genes in the Blue module.

Functional interaction analysis
To better demonstrate the core biological process in COPD-associated AT2s as well as their interrelationships, we used the STRING database 15 to construct a functional interaction network based on the genes which were enriched in more than 2 biological processes in the enrichment map showing the functional topologies among major biological process of AT2s WGCNA Blue module. Network type was physical complex and the edge thickness indicate the strength of data support. Active interaction source was from experiments only. The minimum required interaction score based upon confidence was set as 0.5.

AT2-to-AT1 differentiation driver genes enrichment analysis
To identify lineage trajectories of AT2s to AT1s, we performed trajectory inference analysis using Slingshot. The cluster representing AT2s was chosen as the root node to construct lineages of differentiation to AT1s. We ranked genes according to wald statistics (the gene with the largest waldStat is ranked the top) and picked out the top 13 500 genes based on wald statistics as AT2s-AT1s differentiation driver genes. To investigate whether COPD-, aging-, and smoking-associated DEGs of monocytes are enriched in the 500 AT2s-AT1s differentiation driver genes, we performed Kolmogorov-Smirnov (K-S) test on COPD-, aging-, and smoking-associated DEGs of monocytes and visualized with a kolmogorov-smirnov plot. The one-sided (greater) K-S test was used to determine the enrichment score. The nominal p-value estimates the statistical significance of a single gene set's enrichment score, based on the null distribution generated from gene set permutation.

Gene set variation analysis (GSVA)
To evaluate the activity of gene sets within a given cell type under different conditions, Gene set variation analysis (GSVA) was performed in the GSVA R package (version 1.34.0) 16 . All ontology gene sets were collected from the Molecular Signatures Database (MSigDB). Only genes detected in >10% of cells within a given cell type were further evaluated. As GSVA input, normalized gene expression matrix was first corrected to remove variation due to aging and smoking using WGCNA R package function 'empiricalBayesLM' in each cell type for the COPD condition; normalized gene expression matrix was first corrected to remove variation due to COPD and smoking using WGCNA function 'empiricalBayesLM' in each cell type for the aging condition; normalized gene expression matrix was first corrected to remove variation due to COPD and aging using WGCNA R package 14 function 'empiricalBayesLM' in each cell type for the smoking condition. Differential expression analysis with gene sets enrichment scores as if they were simple gene expression values using the 'lmFit' and 'eBayes' functions from the limma R package. Gene sets with adjust p-value (padjust) < 0.05 were selected as differentially activated gene sets.

Club differentiation driver genes functional enrichment analysis
To identify potential driver genes responsible for club differentiation from the club cells to the ciliated cells, AT1s, and AT2s, we performed trajectory analysis using Slingshot. Club cells was chosen as the root node to construct lineages of differentiation to ciliated cells, to AT1s, and to AT2s. For club cells to ciliated cells, we ranked genes according to wald statistics (the gene with the largest waldStat is ranked the top) and picked out the top 100 genes based on wald statistics as potential driver genes responsible for club cells differentiation from the club cells to the ciliated cells. Finally, we performed functional enrichment analysis on club cells differentiation driver genes using Metascape software. Club cell-to-AT1 and club cellto-AT2 differentiation driver genes were identified as above.

Identification of macrophage subtypes
To identify macrophage subtype, we selected top 400 biological process terms of macrophage with largest standard deviation of GSVA enrichment score. K-means 15 clustering with k = 4 was performed on top 400 biological process GSVA enrichment score of macrophages, and 3 subtypes of macrophage were identified: k-means cluster1 and k-means cluster2 defined as subtype A, k-means cluster3 as subtype B and k-means cluster4 as subtype C.

Correlation analysis based on summarized expression profiles at the individual level
We calculated cell type-specific mean expression within each individual. As a result, each individual has an expression matrix with rows corresponding to genes and columns corresponding to cell types (15 columns in total). To estimate expression correlations between cell types across individuals, for example "A" gene in macrophages and "B" gene in endothelial cells, we actually performed Spearman correlation analysis between mean expression values of A genes of 9 individuals in macrophages and mean expression values of B genes of 9 individuals in endothelial cells. Highly correlated genes are defined as those with Spearman correlation coefficient ≥ 0.8 or ≤ -0.8.

Integration of publicly available scRNA-seq data
To assess the reliability of the three conditions (COPD, aging, and smoking)associated DEGs in monocytes in our dataset, we collected COPD-associated DEGs in monocytes from the scRNA-seq dataset of Li and his colleagues 17 . Since they 16 neither set young controls nor analyzed smoking separately, both aging-and smokingassociated DEGs of monocytes were not available in their datasets but all mixed with COPD-associated DEGs. GO enrichment analysis was performed across our three conditions (COPD, aging, and smoking)-associated DEGs of monocytes and COPDassociated DEGs of Li and his colleagues' via Metascape database 13 . GO biological process terms with moderate p-value were selected to eliminate some biological processes too broad to reflect the specific pathological role and displayed in a heatmap. To further assess the reliability of the three conditions (COPD, aging, and smoking)-associated DEGs in CD14+ monocytes in our dataset, we collected COPDassociated DEGs in CD14+ monocytes from the scRNA-seq dataset of Adams and his colleagues 18 . GO enrichment analysis across these multiple DEGs were performed and visualized as above.
Data from GEO accession number GSE136831, representing scRNA-seq raw count data from 6 control lung homogenates and 3 transplant stage COPD lungs, were used.
Samples' clinical information are given in Dataset 7.xlsx.
To validate the reproducibility of our study, all cells of 9 samples in our study and those of 9 samples in Adams' study were combined as the "9 + 9 dataset".
To identify genes differentially expressed under three conditions (aging, smoking, and COPD) within each cell type of the 9 + 9 dataset, MAST was used to perform zeroinflated regression analysis by fitting a linear mixed model. Considering the batch effect of the 9 + 9 dataset, our 9 samples were defined as batch 1 data, while Adams' 17 9 samples were defined as batch 2 data. We controlled both technical variation and individual variation by using a two-part hurdle model. The following model was fit with MAST: zlm( ~ COPD+ age + (1|individual) + smoking + batch + sex+ percent.mt + nCount_RNA , sca, method='glmer', ebayes=FALSE) For COPD/smoking variable, genes with absolute value of log2(fold change) > 0.14 (10% difference in expression) and FDR < 0.05 between conditions were selected as differentially expressed genes (DEGs). For aging variable, we treat aging as a continuous covariate while controlling for technical variation and individual variation.
Genes were considered as significantly differentially expressed with an FDR < 0.05 and an age coefficient > 0.2 or < -0.2.
To calculate the number of COPD-associated DEGs across cell types normalized by the number of cells in each cell type of the 9 + 9 dataset, we randomly drew 2000 cells from each cell type before performing differential gene expression analysis. This analysis was repeated 10 times and the number of DEGs for each cell type were presented as boxplots. Aging-associated DEGs and smoking-associated DEGs were performed as above. Since no neutrophil cells were annotated as neutrophils in Adams' data, neutrophils were not included in downsampling analysis of the 9 + 9 dataset.
To verify the reliability of GSEA analysis results in our study, we applied the differential expression analysis results of the 9 + 9 dataset to conduct GSEA analysis, and the analysis process was as described in the previous gene set enrichment analysis.

Preparation of aqueous cigarette smoke extract (CSE)
Huangshan Brand cigarettes were manufactured by China Tobacco Anhui Industrial Co., Ltd. Tar, nicotine, and carbon monoxide contents of Huangshan Brand cigarettes were 10 mg/cigarette, 0.9 mg/cigarette, and 11 mg/cigarette, respectively. CSE was prepared by bubbling smoke from 1 cigarette into 10 ml of RPMI medium 1640 (Gibco, Burlington, ON, USA) at a rate of 1 cigarette per 2 minutes, as described previously 19 . The pH of CSE was adjusted to 7.2-7.4. After being filtered through a sterile 0.22 μm filter (Merck&Millipore,Darmstadt, Germany), some CSE preparation was diluted 7 times with RPMI medium 1640 and used to monitor the absorbance at 320 nm (optical density of 0.67 ± 0.01). CSE was freshly prepared for each experiment and diluted with culture medium immediately before use. Control medium was prepared by bubbling air through 10 ml of culture medium, adjusting pH to 7.4, and sterile filtering as described for CSE preparation.

RNA extraction and RT-PCR assay
Total RNA was extracted from macrophages using Trizol reagent (Invitrogen, Grand Island, NY, USA) as previously described 21 . RNA was reverse-transcribed into cDNA with HiScript II Q RT SuperMix for qPCR (Vazyme, Nanjing, Jiangsu, China).

Statistical analysis
For the in vitro experiment related to Fig. 7E, at least three independent experiments were performed. Comparisons were performed using the Student's t test between two groups. Results were presented as means ± SEM. A p-value < 0.05 was considered statistically significant.